%>%%>%“Web scraping is the process of automatically mining data or collecting information from the World Wide Web.” – Wikipedia
Web scraping is a flexible method to extract data from the internet. It can involve extracting numerical or text data.
There are many uses for web scraping, including but not limited to:
Always ensure - PRIOR to scraping - that you have rights to scrape the website.
This is critical as you can be blocked from sites or even face legal action.
Good news! You can easily check with the robotstxt package.
This example shows that Netflix does not allow you to scrape their site.
“HTML is the standard markup language for creating Web pages.” – W3Schools
“CSS describes how HTML elements are to be displayed on screen, paper, or in other media.” – W3Schools
Image credit: Professor Shawn Santo
HTML is strucutred with “tags.” These tags indicate portions of the page and can be called by their structure.
There are many types of tags - here are some important ones for scraping:
<h1> - header tags<p> - paragraph elements<ul> - unordered bulleted list<ol> - ordered list<li> - individual list item<div> - division<table> - tableIf you aren’t familiar with CSS, extracting parts of a website can be daunting.
SelectorGadget is incredibly helpful for this purpose. However, it is only available for Chrome.
Another option is to inspect the page elements, which is available for most major browsers, including Chrome, Firefox, as developer tools.
HTML - syntax is easier and aligns with HTML tags
XPATH - useful when the node isn’t uniquely identified with CSS
Set up the environment to scrape the site.
That’s it! These are all the tools you’ll need.
It only seems appropriate to pull data from Amazon regarding R books
Ensure we can scrape the site
We are good to scrape!
Before you can get started, you must specific the URLs to pass to the function.
Data as of 2020-06-29.
amazon %>%
html_nodes(".s-line-clamp-2") %>%
html_text() -> amazon_titles
head(amazon_titles)
#> [1] "\n \n \n \n\n\n\n\n\n \n \n \n Mastering R for Quantitative Finance\n \n \n \n \n\n\n \n"
#> [2] "\n \n \n \n\n\n\n\n\n \n \n \n R for Data Science: Import, Tidy, Transform, Visualize, and Model Data\n \n \n \n \n\n\n \n"
#> [3] "\n \n \n \n\n\n\n\n\n \n \n \n The Book of R: A First Course in Programming and Statistics\n \n \n \n \n\n\n \n"
#> [4] "\n \n \n \n\n\n\n\n\n \n \n \n R Graphics Cookbook: Practical Recipes for Visualizing Data\n \n \n \n \n\n\n \n"
#> [5] "\n \n \n \n\n\n\n\n\n \n \n \n Discovering Statistics Using R\n \n \n \n \n\n\n \n"
#> [6] "\n \n \n \n\n\n\n\n\n \n \n \n The Programmers Code: A Deep Dive Into Mastering Computer Programming Including Python, C, C++, C#, Html Coding, Raspberry Pi3, And Black Hat Hacking\n \n \n \n \n\n\n \n"The element pulls a number of breaks and blank spaces.
Let’s clean this up with str_trim.
\n), these need to be removedamazon_titles <- str_trim(amazon_titles) # Removes leading & training space
head(amazon_titles)
#> [1] "Mastering R for Quantitative Finance"
#> [2] "R for Data Science: Import, Tidy, Transform, Visualize, and Model Data"
#> [3] "The Book of R: A First Course in Programming and Statistics"
#> [4] "R Graphics Cookbook: Practical Recipes for Visualizing Data"
#> [5] "Discovering Statistics Using R"
#> [6] "The Programmers Code: A Deep Dive Into Mastering Computer Programming Including Python, C, C++, C#, Html Coding, Raspberry Pi3, And Black Hat Hacking"This simple function returns cleaned text.
amazon %>%
html_nodes("a.a-size-base.a-link-normal.a-text-bold") %>%
html_text() -> amazon_format
head(amazon_format)
#> [1] "\n \n \n \n Paperback\n \n \n"
#> [2] "\n \n \n \n Paperback\n \n \n"
#> [3] "\n \n \n \n Kindle\n \n \n"
#> [4] "\n \n \n \n Paperback\n \n \n"
#> [5] "\n \n \n \n eTextbook\n \n \n"
#> [6] "\n \n \n \n Paperback\n \n \n"The price structure splits price into two elements. We must pull each and combine them into a single price.
This element is messier and we’ll need a number of cleaning steps.
amazon %>%
html_nodes("div.a-row.a-size-small") %>%
html_text() -> amazon_rate_n
head(amazon_rate_n)
#> [1] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 3.3 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 13\n \n \n \n \n\n\n\n"
#> [2] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.7 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 426\n \n \n \n \n\n\n\n"
#> [3] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.3 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 75\n \n \n \n \n\n\n\n"
#> [4] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.7 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 14\n \n \n \n \n\n\n\n"
#> [5] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.5 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 255\n \n \n \n \n\n\n\n"
#> [6] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 5.0 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 1\n \n \n \n \n\n\n\n"amazon_rate_n <- str_trim(amazon_rate_n) # trim \n & ' '
head(amazon_rate_n)
#> [1] "3.3 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 13"
#> [2] "4.7 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 426"
#> [3] "4.3 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 75"
#> [4] "4.7 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 14"
#> [5] "4.5 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 255"
#> [6] "5.0 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 1"Let’s assemble the file!
An issue with scraping is sometimes you get an uneven number of records due to missing data elements.
We can fix this!
All titles were populated and scraped accurately. However, due to multiple formats, these records must be repeated to fill the dataframe.
amazon_titles %>%
append(values = amazon_titles[17], after = 17) %>% # R Companion
append(values = amazon_titles[16], after = 16) %>% # R for Dummies
append(values = amazon_titles[15], after = 15) %>% # Linear Models
append(values = amazon_titles[14], after = 14) %>% # R Cookbook
append(values = amazon_titles[13], after = 13) %>% # Intro to Stat Learn
append(values = amazon_titles[12], after = 12) %>% # Learning R
append(values = amazon_titles[11], after = 11) %>% # Baseball with R
append(values = amazon_titles[11], after = 11) %>% # Baseball with R
append(values = amazon_titles[10], after = 10) %>% # Stats with R
append(values = amazon_titles[9], after = 9) %>% # Hands-On R
append(values = amazon_titles[8], after = 8) %>% # Advanced R
append(values = amazon_titles[7], after = 7) %>% # Advanced R
append(values = amazon_titles[6], after = 6) %>% # Interactive Shiny
append(values = amazon_titles[6], after = 6) %>% # Interactive Shiny
append(values = amazon_titles[5], after = 5) %>% # GLM
append(values = amazon_titles[4], after = 4) %>% # Discovering Stats
append(values = amazon_titles[4], after = 4) %>% # Discovering Stats
append(values = amazon_titles[3], after = 3) %>% # R Graphics
append(values = amazon_titles[2], after = 2) %>% # Book of R
append(values = amazon_titles[1], after = 1) -> amazon_titles # R4DS
length(amazon_titles)
#> [1] 39Nothing needed here!
Or here!
Some books do not have ratings. A book only has one rating even if it has multiple formats.
For example, the 6th and 9th book do not have ratings.
We must also account for multiple formats.
Like titles, the ratings need to be repeated to show on the correct row.
The same corrections are done here.
amazon_rating %>%
append(values = amazon_rating[17], after = 17) %>% # R Companion
append(values = amazon_rating[16], after = 16) %>% # R for Dummies
append(values = amazon_rating[15], after = 15) %>% # Linear Models
append(values = amazon_rating[14], after = 14) %>% # R Cookbook
append(values = amazon_rating[13], after = 13) %>% # Intro to Stat Learn
append(values = amazon_rating[12], after = 12) %>% # Learning R
append(values = amazon_rating[11], after = 11) %>% # Baseball with R
append(values = amazon_rating[11], after = 11) %>% # Baseball with R
append(values = amazon_rating[10], after = 10) %>% # Stats with R
append(values = amazon_rating[9], after = 9) %>% # Hands-On R
append(values = amazon_rating[8], after = 8) %>% # Advanced R
append(values = amazon_rating[7], after = 7) %>% # Advanced R
append(values = amazon_rating[6], after = 6) %>% # Interactive Shiny
append(values = amazon_rating[6], after = 6) %>% # Interactive Shiny
append(values = amazon_rating[5], after = 5) %>% # GLM
append(values = amazon_rating[4], after = 4) %>% # Discovering Stats
append(values = amazon_rating[4], after = 4) %>% # Discovering Stats
append(values = amazon_rating[3], after = 3) %>% # R Graphics
append(values = amazon_rating[2], after = 2) %>% # Book of R
append(values = amazon_rating[1], after = 1) -> amazon_rating # R4DS
length(amazon_rating)
#> [1] 39Not all titles have a rating, specifically 5 and 7
We must also account for multiple formats.
amazon_rate_n %>%
append(values = amazon_rate_n[17], after = 17) %>% # R Companion
append(values = amazon_rate_n[16], after = 16) %>% # R for Dummies
append(values = amazon_rate_n[15], after = 15) %>% # Linear Models
append(values = amazon_rate_n[14], after = 14) %>% # R Cookbook
append(values = amazon_rate_n[13], after = 13) %>% # Intro to Stat Learn
append(values = amazon_rate_n[12], after = 12) %>% # Learning R
append(values = amazon_rate_n[11], after = 11) %>% # Baseball with R
append(values = amazon_rate_n[11], after = 11) %>% # Baseball with R
append(values = amazon_rate_n[10], after = 10) %>% # Stats with R
append(values = amazon_rate_n[9], after = 9) %>% # Hands-On R
append(values = amazon_rate_n[8], after = 8) %>% # Advanced R
append(values = amazon_rate_n[7], after = 7) %>% # Advanced R
append(values = amazon_rate_n[6], after = 6) %>% # Interactive Shiny
append(values = amazon_rate_n[6], after = 6) %>% # Interactive Shiny
append(values = amazon_rate_n[5], after = 5) %>% # GLM
append(values = amazon_rate_n[4], after = 4) %>% # Discovering Stats
append(values = amazon_rate_n[4], after = 4) %>% # Discovering Stats
append(values = amazon_rate_n[3], after = 3) %>% # R Graphics
append(values = amazon_rate_n[2], after = 2) %>% # Book of R
append(values = amazon_rate_n[1], after = 1) -> amazon_rate_n # R4DS
length(amazon_rate_n)
#> [1] 39Create extra rows due to multiple book formats.
amazon_pub_dt %>%
append(values = amazon_pub_dt[17], after = 17) %>% # R Companion
append(values = amazon_pub_dt[16], after = 16) %>% # R for Dummies
append(values = amazon_pub_dt[15], after = 15) %>% # Linear Models
append(values = amazon_pub_dt[14], after = 14) %>% # R Cookbook
append(values = amazon_pub_dt[13], after = 13) %>% # Intro to Stat Learn
append(values = amazon_pub_dt[12], after = 12) %>% # Learning R
append(values = amazon_pub_dt[11], after = 11) %>% # Baseball with R
append(values = amazon_pub_dt[11], after = 11) %>% # Baseball with R
append(values = amazon_pub_dt[10], after = 10) %>% # Stats with R
append(values = amazon_pub_dt[9], after = 9) %>% # Hands-On R
append(values = amazon_pub_dt[8], after = 8) %>% # Advanced R
append(values = amazon_pub_dt[7], after = 7) %>% # Advanced R
append(values = amazon_pub_dt[6], after = 6) %>% # Interactive Shiny
append(values = amazon_pub_dt[6], after = 6) %>% # Interactive Shiny
append(values = amazon_pub_dt[5], after = 5) %>% # GLM
append(values = amazon_pub_dt[4], after = 4) %>% # Discovering Stats
append(values = amazon_pub_dt[4], after = 4) %>% # Discovering Stats
append(values = amazon_pub_dt[3], after = 3) %>% # R Graphics
append(values = amazon_pub_dt[2], after = 2) %>% # Book of R
append(values = amazon_pub_dt[1], after = 1) -> amazon_pub_dt # R4DS
length(amazon_pub_dt)
#> [1] 39r_books <- tibble(title = amazon_titles,
text_format = amazon_format,
price = amazon_price,
rating = amazon_rating,
num_ratings = amazon_rate_n,
publication_date = amazon_pub_dt)
head(r_books)
#> # A tibble: 6 x 6
#> title text_format price rating num_ratings publication_date
#> <chr> <chr> <dbl> <dbl> <dbl> <date>
#> 1 Mastering R for Quantit~ Paperback 45.4 3.3 13 2015-03-10
#> 2 Mastering R for Quantit~ Paperback 39.5 3.3 13 2015-03-10
#> 3 R for Data Science: Imp~ Kindle 25.0 4.7 426 2017-01-10
#> 4 R for Data Science: Imp~ Paperback 33.0 4.7 426 2017-01-10
#> 5 The Book of R: A First ~ eTextbook 30.0 4.3 75 2016-07-16
#> 6 The Book of R: A First ~ Paperback 23.9 4.3 75 2016-07-16Web Scraping in R & rvest repo
This talk is freely distributed under the MIT License.